ReneWind

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 40000 observations in the training set and 10000 in the test set.

The objective is to build various classification models, tune them and find the best one that will help identify failures so that the generator could be repaired before failing/breaking and the overall maintenance cost of the generators can be brought down.

“1” in the target variables should be considered as “failure” and “0” will represent “No failure”.

The nature of predictions made by the classification model will translate as follows:

So, the maintenance cost associated with the model would be:

Maintenance cost = TP*(Repair cost) + FN*(Replacement cost) + FP*(Inspection cost) where,

Here the objective is to reduce the maintenance cost so, we want a metric that could reduce the maintenance cost.

So, we will try to maximize the ratio of minimum possible maintenance cost and the maintenance cost associated with the model.

The value of this ratio will lie between 0 and 1, the ratio will be 1 only when the maintenance cost associated with the model will be equal to the minimum possible maintenance cost.

Data Description

Importing libraries

Loading Data

Check for any missing data

Check for any missing data on the test set

Check the info of the data

Check for duplicates

Let's check the duplicate data. And if any, we should remove it.

EDA and insights

Functions for plotting

Univariate Analysis

Bivariate Analysis

Data Pre-processing

Missing value Treatment

From the count of the missing data, V1 and V2, the missing data pattern on V1=46 and V2=39 is independent of each other.

Since most of the variables are near normal distribution we can use the median imputer to replace missing values

Test data

Impute missing values on training and validation data

Impute missing values on test data

Model evaluation criterion

3 types of cost are associated with the provided problem

  1. Replacement cost - False Negatives - Predicting no failure, while there will be a failure
  2. Inspection cost - False Positives - Predicting failure, while there is no failure
  3. Repair cost - True Positives - Predicting failure correctly

How to reduce the overall cost?

Let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

Defining scorer to be used for hyperparameter tuning

Model Building with Original data

Logistic Regression (with Sklearn library)

Decision Tree

Random Forest

Bagging classifier

Model Building - Boosting

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

Model With original Data

We can see that XGBOOST is giving the highest cross-validated recall followed by RandomForest. RandomForest and XGBoost can be potential candidates for performance tuning.

Model Building with Over sampled data

We can see that XGBOOST is giving the highest cross-validated recall followed by RandomForest. RandomForest and XGBoost can be potential candidates for performance tuning.

Model Building with Under sampled data

We can see that XGBOOST is giving the highest cross-validated recall followed by RandomForest.

Model Selection

Model Training Validation * 1-Random forest_original_data 71.53861239850332 72.93318233295584 * 2-GBM_oversampled 87.28991668797381 73.20954907161804 * 3-Xgboost_undersampled 84.87637818095986 75.1458576429405 * 4-Xgboost with original data 77.45266458451022 77.52808988764045 * 5-Random forest with oversampled data 97.1952035999941 80.836820083682 * 6-Xgboost with over sampled data 97.58309648832963 80.66805845511482

HyperparameterTuning

For XGBoost:

param_grid={'n_estimators':np.arange(150,300,50),'scale_pos_weight':[5,10], 'learning_rate':[0.1,0.2], 'gamma':[0,3,5], 'subsample':[0.8,0.9]}

For Gradient Boosting:

param_grid = { "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)], "n_estimators": np.arange(75,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}

For Adaboost:

param_grid = { "n_estimators": np.arange(10, 110, 20), "learning_rate": [ 0.2, 0.05, 1], "base_estimator": [ DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1)]}

For logistic Regression:

param_grid = {'C': np.arange(0.1,1.1,0.1)}

For Bagging Classifier:

param_grid = { 'max_samples': [0.8,0.9], 'max_features': [0.8,0.9], 'n_estimators' : [40,50]}

For Random Forest:

param_grid = { "n_estimators": [150,250], "min_samples_leaf": np.arange(1, 3), "max_features": ['sqrt','log2'], "max_samples": np.arange(0.2, 0.6, 0.1)}

For Decision Trees:

param_grid = {'max_depth': np.arange(2,20), 'min_samples_leaf': [1, 2, 5, 7], 'max_leaf_nodes' : [5, 10,15], 'min_impurity_decrease': [0.0001,0.001] }

1- Random Forest - With Original Data - Tuning - GridSearchCV

2(a)-Gradient Boost Classifier - With Oversampled Data - Tuning - GridSearchCV

2(b)-Gradient Boost Classifier - With Oversampled Data - Tuning - RandomizedSearchCV

3(a)- Xgboost Classifier - With Undersampled Data - Tuning - GridSearchCV

3(b)- Xgboost Classifier - With Undersampled Data - Tuning - RandomizedSearchCV

4- Xgboost with original data - Tuning - GridSearchCV

5- Random forest with over sampled data - Tuning

6- Xgboost with over sampled data - Tuning

Lets check the important features for model Xgboost with original data

Model Performance comparison and choosing the final model

Test set final performance

Pipelines to build the final model

Business Insights and Conclusions

  1. Variable V18 and V36 are the important features.
  2. From the model performance in test set, there is an 80% probability that failures can be identified.
  3. Both the models that have more than .78 performance on test data, exhibit overfit, thus the performance in real time might be impacted.
  4. Since the ratio of the failure class is less, more data can be collected on the failure class.